Bayesian graph convolutional network with partial observations

As a widely studied model in the machine learning and data processing society, graph convolutional network reveals its advantage in non-grid data processing. However, existing graph convolutional networks generally assume that the node features can be fully observed. This may violate the fact that many real applications come with only the pairwise relationships and the corresponding node features are unavailable. In this paper, a novel graph convolutional network model based on Bayesian framework is proposed to handle the graph node classification task without relying on node features. First, we equip the graph node with the pseudo-features generated from the stochastic process. Then, a hidden space structure preservation term is proposed and embedded into the generation process to maintain the independent and identically distributed property between the training and testing dataset. Although the model inference is challenging, we derive an efficient training and predication algorithm using variational inference. Experiments on different datasets demonstrate the proposed graph convolutional networks can significantly outperform traditional methods, achieving an average performance improvement of 9%.


Introduction
Recent years have witnessed the great success of Convolutional Neural Network (CNN) in many different data processing fields [1][2][3][4].However, CNNs are primarily designed for the grid dataset.Graph, as one of the most widely used non-grid data structure in the modern digit society (such as community detection, drug design, molecular generation, and etc. [5][6][7][8][9]), reveals its difficulty when exploiting the CNN architecture.To overcome this difficulty, Graph Convolutional Networks (GCNs) [10][11][12] have been proposed.In a typical GCN framework, graph is organized into two different parts, node relationship A and node feature X.Then, a deep network framework is applied to map the node features X into a novel graph representation space constrained by the graph relationship A [13,14].
Although the existing GCNs are powerful and effective tools, it assumes that the graph nodes are equipped with fully observed features X.This assumption may not hold in many real applications (Fig 1, in a private social network, node has no information to display.), which have only the relationships and the corresponding node features are unavailable.To tackle this problem, first, we construct a Bayesian GCN generative model, in which the pseudo features are used to simulate the real features (when applying our method to the graph with features, a concatenation strategy is proposed and constructed in the pseudo features generation process).Then, a hidden structure term is proposed to generate suitable pseudo features.Finally, we derive the corresponding training and predication algorithm.We conclude our main contributions as follows: • Conventional GCN has been extended to a Bayesian framework, where pseudo features generated from the stochastic process are used to simulate real features.Our model handles GCN graph processing task with or without node features in a unified framework, offering a novel prospect within a Bayesian framework for the graph node classification without features.
• To maintain the independent and identically distributed property of the pseudo features, a hidden space structure preservation term has been proposed and utilized to constrain the sample generation process.
• For the non-conjugated property, we employ a mean field variational inference integrated with Variational Auto-Encoder (VAE) for model training and predication.We organize our paper as follows.Section two covers the related work.Section three will briefly reviews the preliminary knowledge.Section four presents the details of the proposed method and the corresponding variational inference algorithm.Section five is the experimental results, including some parameter effect analysises are also carried out in this section.Section six concludes the paper.

Node classification
Generally, GCN can be roughly organized as two categories: spatial methods and spectral methods.In the first category, graph convolution is defined as the operation of neighbors [15,16].For example, Atwood et al. [15] extend the Convolutional Neural Networks by employing the graph diffusion process to integrate the node neighbor information.Duvenaud et al. [16] introduce the graph convolutional operation by applying the convolution-like propagation rule on graphs.Niepert et al. [17] define the GCN by converting the graph into sequences and apply the conventional CNN model.Monti et al. [18] present mixture model CNNs, then they define a CNN model on the graph data.In the second category, spectral representation of graphs is introduced into the graph convolution definition [19].For example, Bruna et al. [19] construct the graph convolutional operation in the Fourier domain by exploiting the eigendecomposition of graph Laplacian matrix.Due to the high computational complexity, Defferrard and Kipf et al. [20,21] extend Bruna's work by approximating the spectral filters with the Chebyshev expansion and the first-order approximation of spectral graph.Jiang, Tang, Li and Franceschi et al. [13,14,[22][23][24] improve the GCN classification accuracy by utilizing a graph learning framework.Gan and Zhao et al. [25,26] generate multiple graph structures and fuse the information of multiple graphs to improve the GCN performance.Besides these theoretic analysis, many researchers focus on extending the GCN to the conventional machine learning and computer vision tasks.For example, Cai et al. [27] exploit GCN to estimate the 3D Pose.Yan and Huang et al. employ GCNs to handle the skeleton-based action recognition [28][29][30].Yang and Wang et al. [31][32][33][34] extend GCN application scope to the clustering task.Zhang and Chen et.al. exploit graph convolutional network for the zero-shot learning [35][36][37].Compared to these conventional GCN models which require the node feature, our model could handle the graph data without this information.

Incomplete data learning
Data missing is a ubiquitous issue which has attracted many attentions in the field of data mining, machine learning and computer vision [38].When handling this problem, the most widely studied method is the data imputation [39], which fills the missing attribute values with the help of the known attribute.For example, Azur et.al. [40] use chained equations to iteratively impute the miss variables in feature.Mean methods impute the miss values by averaging the known values [38], or discard the corresponding missing values directly to make the algorithm work [41].Although these methods have shown its effective, the assumptions they make when constructing this method might result in biased predictions [42].Besides these methods which take the statistical views, recently, many researches refer to the machine learning technics.For example, Acuna et.al. [43] exploit the k-nearest neighbors algorithm.Dick et.al. [44] apply the generative model to infer the missing values.Lakshminarayan et.al. [45] use the decision trees.Zhang et.al. exploit associate-rule based imputation and rough set method [46,47].More recently, deep neural networks have also been applied to the data missing imputation problems [48][49][50][51].Although these methods have achieved the remarkable performance in many data imputation tasks.They take the assumption that the features can be partially observed.Unlike these methods, our method can be applied to the graph node classification problem without features.

Deep generative models
Deep generative models aim to model the real complicated distributions via a deep neural network.Generally, deep generative model can be roughly categorized into two different classes.The first broad class is the variational inference and sampling based method.For example, Variational Auto-Encoders (VAEs) [52,53] extend the Gaussian generative model with the deep neural network [54][55][56], and then use the variational inference to compute the posterior probability and the likelihood function.Deep Belief Networks (DBNs) [57,58] stack multilayer restricted Boltzmann machines [59] and apply the sampling based method to achieve the maximized likelihood probability.In addition to the first class method which learns the model with the variational inference and sampling method, the second type of deep generative model is the implicit method.The most typical method is the Generative Adversarial Network [60,61], which expands the maximum likelihood principle of the generative model by using an adversarial strategy.Stochastic network is another implicit method which uses Markov chain to construct the deep generative model [62].Different from the above methods which use the generative model to fit the real distributions of the given samples, our model is designed to graph node classification task with partial observed graph data.

Preliminary
In this section, we briefly review the preliminary knowledge of Graph Convolutional Network (GCN) and Variational Auto-Encoder (AVE) model.Our model will be derived in a later section.

Graph Convolutional Network
Graph Convolutional Network (GCN) extends the conventional convolutional operation on the grid structure to non-grid structure within the deep neural network framework.Given a graph denoted as (A, F) where A encodes the pairwise relationship and F represents the node features, GCN employs the following layer-wise propagation in hidden layers: where H h−1 and H h are the hidden output in the layer h − 1 and h (H 0 is set to be X).D is a diagonal matrix with d i ¼ P N j¼1 A i;j .σ(�) is the activation function which is usually set as the ReLU or the sigmoid.For classification task, GCN defines a softmax convolution layer at the final output layer.
where Y is usually used as the label distribution.For the training process, GCN utilizes the cross-entropy loss.

Variational Auto-Encoder
Variational Auto-Encoders (VAEs), introduced by Kingma and Welling [52], are popular methods in many machine learning and computer vision tasks.Given the observation dataset , VAE assumes the model can generate the data as follows: (1) For the hidden variable t i , draw t i * N(t|0, I).
(2) Draw the observation samples xi � NðxjFðt; WÞ; sÞ.where t i is the hidden variable, F(t, W) is a neural network.To train the network, VAE uses the variational inference and derive the following loss function: where q(t i ) is also a neural network.Then, VAEs optimize the above loss function via a standard back-propagation algorithm integrated with a reparameterization trick.Following, we will use the same optimization trick, but use different notations of xi and q(t i ).

Our proposed method
In this section, we introdude our method called Bayesian Graph Convolutional Network (BGCN) for graph node classification without features.Subsequently, we derive the corresponding training and predication algorithm based on variational inference.Main notations and descritions are summarized in Table 1.

Bayesian graph convolutional network
Given the graph (A, F, Y) where A 2 R N�N , F 2 R N�D and Y 2 R M�K denote the pairwise relationship, node features the and the training labels respectively.Here, M denotes that there is M training samples in the graph (A, F)(M < N), K is the class number.we consider the problem that F is not available.In order to handle this problem, our idea is to equip the input with pseudo features.One straightforward pseudo feature is a constant value.However, this leads a problem that the input is unable to distinguish the difference between different samples.Another idea is that the pseudo features are generated from random distributions.Although random pseudo features can identify the difference, features from training set and testing set come from different distribution (this refers to the problem of nonindependent and nonidentically distributed issue, Fig 2).To tackle this problem in our model, we use the graph to constrain the pseudo feature generation process, which requires the pseudo features are generated with the consistent of the given graph.Note that pseudo features can be used to handle node without features, when the node features are available, we concatenate it with the generated pseudo feature.Our BGCN generation process is: (1) For the pseudo feature x i , x j , draw x i , x j * N(x|0, I).(2) Maintain the struture of x i , x j with the graph relationship A, draw l i,j * N (l|A i,j |x i − x j | 2 , σ). (3) For labels of the pseudo features y i , y j , draw y i , y j * N(y|GCN(x, W), σ).
where N(x|0, I) is the Gaussian distribution with the constant parameter 0 and I.A i,j is the element in A. GCN(x, W) denotes the Graph Convolutional Network with the parameter W. Note that, when maintaining the structure in the hidden space, we set l i,j = 0, which means that the generated pseudo features x i and x j are forced to be consistent with the graph structure.The Probabilistic Graphical Model (PGM) is shown in Fig 3 where the pseudo features are generated from the Gaussian distribution and constrained by the graph, and the labels are generated from the pseudo features.Our model alters the discriminative GCN model to a generative model.

Variational inference
In the previous section, we have constructed the corresponding Bayesian graph convolutional network model.In this section, we derive the corresponding learning and predication algorithm.Following the variational inference framework [63], we derive the Evidence Lower where p(Y, A, |0, I, w) denotes the joint distribution of observations Y and A, p(X, Y, A) is the joint distribution of the hidden variables.q(X) represents the variational posterior distribution of the pseudo feature distribution.Note that, since label Y is generated from the pseudo feature with the GCN network, q(X) cannot be derived following the standard variational framework.
In our model, we adopt a strategy used in the Variational Auto-Encoder (VAE) [55] network, in which the hidden variable X can be form as another neural network with parameter w, qðXÞ ¼ Q N i¼1 q wðx i Þ.We now extend the ELBO, and derive the following loss function: where S(x i ) and u(x i ) are the output of the q w ðx i Þ, in which the first half of q w ðx i Þ forms as the mean and the last half forms as the covariance.Note that, integrating over the neural network has no analytical solution.Thus, we employ a sampling method to calculate this term.
where x * i;s is sampled from the distribution q w ðx i Þ; for the parameter σ, we set it as 2. Optimizing Eq 4 can be done using the standard back-propagation algorithm if the feature is available.However, this violates our assumption and also leads to a trivial solution.A simpler method is that we take the derivative with respect to the parameter S(x n ) and u(x n ) directly.But taking derivative w.r.t. a neural network is challenging, and optimizing through a neural network is inefficient.Below, we show, by using some simple constraints and auxiliary variables Au i = u (x i ), As i = S(x i ), our method can achieve the efficient solution.From the ELBO, we know that loss function to optimize u(x i ) i is: We set some constraints to the variable u(t i ), that is AuAu T = I: Au ¼ fAu 1 ; . . .; Au i ; . . .; Au N g Rearranging the above equations, we have: Where U is defined same as Au.To solve this problem, we form the Lagrangian function, set ŝ ¼ 2 and set λ d as the Lagrangian multipliers of the first constraint and relax AuAu T = UU T , Then by taking the derivative w.r.t.Au d : Where L is the graph Laplacian of A. Au d is d'th row of Au. λ u is the Lagrangian multipliers of the second constraint.The equation above can be solved by employing eigenvalue decomposition.For Us n , we have: After achieving the initialized u(x i ) i , we exploit the standard back propagation algorithm to further optimize log Nðy i jGCNðw; x * i;m Þ; ŝÞ.We summarize the BGCN training and predicating algorithm in algorithm 1. Flowchart of the proposed method is summarized in Training procedure 1: Compute the parameters of q(X) using Eqs ( 6) and ( 7).2: Sampling the parameters x * i;m from q w ðx i Þ with the training dataset.3: Normalizing the parameter u(x i ).4: If the the observed feature F is available, concating normalized parameter u(x i ) with the observed feature F. Else, use parameter u (x i ) as the feature.5: Using the loss of Eq 4 to train the GCN model.Predication procedure 1: Sampling x * i;m from q w ðx i Þ with the given predication dataset.2: Normalizing the parameter u(x i ).3: If the the observed feature F is available, concating normalized parameter u(x i ) with the observed feature F. Else, use parameter u (x i ) as the feature.4: Using the Eq (4) to achieve the GCN output.
Note that, when our model is applied to the graph dataset with the node features, from the derivation, we know that concating the original features with the pseudo features is equal to concat it with the posterior parameter u(t i ).For the computational cost, The main computational cost comes in two branches: (1) eigenvalue decomposition, which adds the O(N 3 ) where N stands for the number of graph nodes; (2)GCN model.Suppose that, in the GCN model (with L layers and K iterations), each node has m l -dimensional features.Then, computational cost is

Experiments
In this section, we empirically evaluate the effectiveness of the proposed BGCN, and compare it to several existing methods.Then, we measure the influence of BGCN parameters on real graph datasets.

Experimental setup
Datasets.Three real-world graph datasets are used to evaluate our method performance, including the Citeseer, Cora and Pubmed [64].The details of these datasets are as follows: (1) Citeseer Dataset: Citeseer is a citation network which contains 3327 nodes, 4732 edges and six classes.
(2) Cora Dataset: Cora dataset has 2708 nodes and 5429 edges, in which every node falls into 6 classes.
(3) Pubmed Dataset: Pubmed is a dataset with 19717 nodes and 44338 edges.Each node in the dataset falls into 3 classes.
In addition to the real graph datasets, we also exploit several image datasets (Extended YaleB, Orl, Yale, Usps, Coil20, Coil100), in which we use the k-nearest neighbor method to construct the graph.Details about neighbors and attribute used in these image datasets are demonstrated in Table 2. Some samples from the six image datasets are demonstrated in Fig 6. i;m from a neural network variational posterior distribution q w ðx i Þ with input F, the entire algorithm cannot be optimized.In figure (b), instead of optimizing the network with feature F, we apply eigenvalue decomposition and an updating rule to achieve the mean and covariance of the output using only the graph A. When applying the full features F, we concatenate the variational posterior parameter with the given features. https://doi.org/10.1371/journal.pone.0307146.g005 Experimental settings.For citeseer, cora and pubmed, we follow the experimental settings in [21].For the Coil20 dataset, we use 30 samples each class as the training dataset, and use the other 42 samples as the testing dataset.For the usps dataset, we use 200 samples each class as the training dataset, and the other samples as the testing dataset.In the case of the Extended YaleB, Yale, and Orl datasets, we split the samples evenly, using half for training and the other half for testing.For all datasets, we maintain a learning rate of 0.01 and a dropout rate of 0.5.In our experiment, we also use the l 2 regularization item for the weight decay.The loss function in our experiment is altered to cross entropy which is equal to the least square loss function used in our model.We set the hidden layer in our experiment with 16-dimension features, and 3 layers.For the he evaluation metrics, we exploit the classification accuracy (the proportion of correctly predicted instances out of the total number of instances in the dataset).We implement our model on a computer with XEON 4210R CPU and 62GB RAM.The GPU is RTX 2080TI with 11GB memory.GCN is implemented with tensorFlow.The system we used in our experiment is Linux (Ubuntu version).
Baselines.In our experiment, we compare our method with some other graph based learning algorithms.The compared methods contain: 1) Label Propagation (LP) [65], 2) Deep-Walk network [66], 3) The original Graph Convolutional Network [21], 4) Chebyshev polynomial version of graph convolutional network [20], 5) Graph attention networks (GAT) [67].Note that, for the graph based deep learning methods like GCN, GAT and Chebyshev, we replace the input with noise input (input generated from a Gaussian distribution) and noninformation input (input with some constant values).In order to investigate the influence of the graph, a simple MLP algorithm is also demonstrated in our experiment.When operating our method, we equip our framework with different GCN models (GAT, GCN, and Chebyshev).Comparison with the existing baseline methods is summarized in Table 3.

Experimental results
We evaluate our method on both datasets: one without node features and the other with node features (results are demonstrated in Tables 4-7).From the result, we can draw some points: (1) When compared to graph-based methods like LP and DeepWalk, our GCN-based method demonstrates a significant improvement in classification accuracy.(2) GCN with different inputs demonstrates that features play a crucial role in the GCN node classification problem.Stochastic inputs consistently result in a stochastic output.
(3) When comparing our method with the conventional GCN using different inputs, we conclude that our framework significantly improves classification accuracy.(4) When conducting the experiments with the full features, it is not surprising to see that conventional GCN models perform better than the proposed approach.The reason is that our method equips the GCN model with a Bayesian framework and a hidden layer constraint term, which is difficult to be optimized.
(5) Comparing the results on the dataset with features to those without features, we find that BGCN with the features can significantly improve classification accuracy.

Effect of algorithm parameters
In this subsection, we investigate the influence of algorithm parameters under different algorithm settings.
(1) Dimension of the pseudo features' hidden variable: In this experiment, we vary the dimension of the hidden variable from 10 to 40 and conduct experiments on five different real datasets (Fig 7).The experimental results indicate that small values may lead to decreased classification accuracy.This could be because smaller dimensions contain less information compared to larger values, which can capture more detailed information about the original graph structure.
(2) Number of the GCN hidden units: Similar to the experiments on dimension effects, we investigated the impact of GCN hidden units using different values.Specifically, we varied the value from 10 to 40 (Fig 8).From the result, we know that, different to the Dimension of the hidden variable, classification accuracy of the BGCN is not sensitive to the hidden units number in the citeseer and cora, and increase the classification accuracy in the pubmed, usps and coil20 dataset.The reason may be that pubmed, usps and coil20 datasets are much more complicated than the citeseer and cora, and require a much more complicated model.
(3) Rate of Original and Pseudo Features: For the algorithm with original features, we construct additional experiments with various rate of original and pseudo feature dimensions.In our experiments, we decrease the rate of original and pseudo features dimension from 100% to 10% (Figs 9 and 10).From the experimental results, we know that some datasets decrease their classification accuracy when the rate of original features is decreasing.The reason is that original features contain much more information that may not be simulated by the pseudo features.Additionally, for the pseudo features, we also observe that classification accuracy decreases when the rate of pseudo features decreases.The reason is that pseudo features contain much more information than the original features for some datasets.

Convergence analysis
We evaluate the convergence of Algorithm 1 on real graph datasets (Cora, Citeseer, Pubmed).We show the convergence curve in Fig 11 .From the figure, we can draw a conclusion that our model converges after 500 iterations.We also find that the loss curve is stable in the citeseer dataset and unstable in the cora and pubmed dataset.The reason is that cora and pubmed are much more complicated dataset than the cora dataset (as evident from the classification accuracy in Tables 6 and 7, where the accuracy for Cora and Pubmed is lower than that for Citeseer.).

Conclusions
In this paper, we extend the application scope of the Graph Convolutional Network.Different from the conventional GCN methods which require the features in the input space, our method equips the GCN input with the generated pseudo features, and assumes that the labels are generated from the GCN with a Bayesian framework and graph constraint.Experiments with the graph constraint generation features demonstrate some facts that: (1) random (3) although our model is able to handle the different graph applications, it requires the eigenvalue decomposition which is time cost.Thus, a fast eigenvalue decomposition will be a plus.Additionally, for the real system, our model can be used to replace the conventional GCN model as plug and play modules.

Fig 1 .
Fig 1.Private social network.In this figure, there are two communities (each color indicates one community).In this network, due to the privacy considerations, nodes have no information to display.https://doi.org/10.1371/journal.pone.0307146.g001

Fig 2 .
Fig 2. Feature generation problem.The figure illustrates the pseudo features generated by two different distributions: random distribution and graph-constrained distribution.Pseudo features generated by the random distribution fail to preserve class relations (same class in different space.).However, pseudo features generated with graph constraints successfully maintain the class relationships during the generation process (same class in the similar space.).

Fig 3 .
Fig 3. Dependency between random parameters in our model.Probabilistic graphical model of BGCN.Specifically, consider the graph associated with the model, denoted as G.In this graph, blue nodes represent observations, while gray nodes correspond to partial observation labels.Notably, from the figure, we observe that our observed label Y is generated from the pseudo feature x i .https://doi.org/10.1371/journal.pone.0307146.g003

Fig 4 . 1
The Full optimization procedure is summarized in Fig5.Algorithm Training and predication algorithm for BGCN with fully observed featuresRequire:Labels Y 2 R M for the training dataset, a given graph A, and the corresponding features F.Ensure:Labels Y 2 R M for the predication dataset.

Fig 4 .
Fig 4. Flowchat of the proposed algorithm.Figure (A) demonstrates the flowchart of the proposed method without node features.Figure (B) is the flowchart of the proposed method with node features.https://doi.org/10.1371/journal.pone.0307146.g004

Fig 5 .
Fig 5. BGCN optimization framework.In figure (a), we take the node feature as the input and use a neural network to infer the posterior distribution.However, since we sample x *i;m from a neural network variational posterior distribution q w ðx i Þ with input F, the entire algorithm cannot be optimized.In figure (b), instead of optimizing the network with feature F, we apply eigenvalue decomposition and an updating rule to achieve the mean and covariance of the output using only the graph A. When applying the full features F, we concatenate the variational posterior parameter with the given features.

Fig 8 .Fig 7 .
Fig 8. Parameter effect.Illustration of the effect of the GCN hidden unit number.The x-axis represents the GCN hidden unit dimension, and the yaxis represents the classification accuracy.https://doi.org/10.1371/journal.pone.0307146.g008

Fig 10 .Fig 9 .
Fig 10.Parameter effect.Illustration of the effect of original features.The X-axis represents the rate of the original features' dimension, and the Y-axis represents the classification accuracy.https://doi.org/10.1371/journal.pone.0307146.g010

Table 1 . MAIN notations and descriptions.
i , As i Auxiliary variable Au i = u(x i ), As i = S(x i ) L Laplacian matrix of A https://doi.org/10.1371/journal.pone.0307146.t001

Table 7 . Classification accuracy on the image dataset with features.
In these datasets, we construct the graph using the k-nearest graph algorithm.

Table 6 . Classification accuracy on the image dataset without features.
In these datasets, we construct the graph using the k-nearest graph algorithm.